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Abstract 

This paper studies the mixing time of certain adaptive Markov 
Chain Monte Carlo algorithms. Under some regularity conditions, 
we show that the convergence rate of Importance Resampling MCMC 
(IRMCMC) algorithm, measured in terms of the total variation dis- 
tance is Oln^ 1 ), and by means of an example, we establish that in 
general, this algorithm does not converge at a faster rate. We also 
study the Equi-Encrgy sampler and establish that its mixing time is 
of order 0(n~ x l 2 ). 

Keywords: adaptive Markov Chain Monte Carlo, mixing time, total 
variation distance, Importance-Resampling Algorithm, Equi-Energy 
Sampler. 
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1 Introduction 

Markov Chain Monte Carlo (MCMCj methods have been widely applied in 
many areas for more than 40 years [16j |. In particular, they are successful 
when the target distribution ir is available only up to a normalizing constant. 
To sample from such a distribution, various MCMC algorithms have been 
developed. A typical MCMC algorithm is designed using a transition kernel 
P that has ir as stationary distribution. See for example Meyn and Tweedie 
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20( , Robert and Casella 2l[ , Roberts and Rosenthal [22j , and the references 
therein. 

Constructing a transition kernel to sample efficiently from a given distri- 
bution, although conceptually easy, is rather difficult in practice. The diffi- 
culty is that generic algorithms such as the Metropolis-Hastings algorithm 
requires a careful choice and tuning of the proposal kernel. The development 
of adaptive MCMC (AMCMC) methods stems partly from these difficulties. 
Instead of having a fixed Markov kernel P, at each round n an AMCMC 
algorithm selects a kernel Pg from a family of Markov kernels {Po}q£@, 

where the value (parameter) 6 n is computed based on possibly all the sam- 
ples generated up to time n, so that the transition kernel is automatically 
self-adapted. This approach is very appealing in practice, as it frees the 
users from parameter tuning, and provides a better exploration-exploitation 
balance in the performance. As a consequence, AMCMC algorithms often 
yield smaller asymptotic errors in Monte Carlo estimations. The theoreti- 
cal and methodological studies of AMCMC have drawn attentions of many 
researchers lately. See for example the survey by Atchade et al. 0| and the 
references therein. 

In this paper, we investigate the convergence rates of two AMCMC al- 
gorithms: the Importance Resampling MCMC (IRMCMC) algorithm intro- 



duced by Atchade [5j, and the Equi-Energy (EE) sampler by Kou et al. [171 ]. 
The IRMCMC algorithm is also referred to interacting annealing algorithm 
For the EE sampler, we actually focus on a simplified version, which is 



sometimes referred to as interacting tempering algorithm [11]. 

Throughout the paper we denote by {X n } nS N the random process gener- 
ated by either of these algorithms. A common feature is that in either case, 
the dynamics of {X n } n ^fq is driven by a sequence of random measures 6 n 
computed from an auxiliary chain {l^jneN- Most of the theoretical results 
available so far focused on the convergence of marginal distributions 

and on the law of large numbers: 

1 n 

lim — N f{Xi) = 7r(/) almost surely. 

n— >oo n 

i=l 

See for example Andrieu et al. 0,3, Atchade Ml and Fort et al. Hj (there 
is a mistake in the proof of 0], pointed out in loj). Central limit theorems 
for such AMCMC algorithms have only been considered recently by Fort 
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et al. 11] and Bercu et al. @>0|- I n short, introducing the auxiliary chain 
makes the stochastic process no longer Markov, which raises considerable 
technical difficulties. We point out that there are other classes of AMCMC 
algorithms, for which the parameters take values in finite dimensional s pac es 



(e.g. the adaptive Metropolis algorithm introduced by Haario et al. [131]). 
The analysis of such algorithms is out of the scope of this paper. 

In this paper, we study the convergence rate (or mixing time) of the IRM- 
CMC and EE algorithms. That is, we provide upper bounds on the distances 
between Cx n (the distribution of X n ) and the target distribution. This type 
of rate differs and complements rate of convergence obtained in central limit 
theorems. Mixing time results provide information on the burn-in time 
of the algorithm, whereas central limit theorems (such as those mentioned 
above) deal with the rate of convergence and fluctuations of Monte Carlo 
averages. Beside Andrieu and Atchade []| who considered AMCMC with a 
finite-dimensional parameter, and to the best of our knowledge, the mixing 
time of AMCMC has not been investigated so far. 

We show that the IRMCMC algorithm has convergence rate of order 
0{n~ l ). In particular, we also provide a simple example, for which the 
convergence rate has lower bound 1/n. That is, one should not expect a 
rate superior to 0(n _1 ) in general. We also show that for m-tuple IRMCMC 
(to be defined in section f2.4|) . the convergence rate is also within 0(n~ l ). 
For the EE sampler, under some regularity conditions, we show that the 
rate of convergence is 0(n _1//2 ) in terms of a slightly weaker norm than the 
total variation distance. Our results are qualitative, in that the constants 
in the rates are not explicit. But they clarify what is known about these 
algorithms, and suggest in particular that longer burn-in should be selected 
for the EE sampler. 

The rest of the paper is organized as follows. The remaining of the intro- 
duction gives a general description of the algorithms considered in the paper 
and introduces some notation. Section [2] is devoted to IRMCMC algorithm. 
The convergence rate is established in Section 12.11 and for multiple IRM- 
CMC in Section [2.4i Section [3] is devoted to EE sampler. The convergence 
rate is established in Section 13.11 A remark on the connection to parallel 
tempering is provided in Section [ 



1.1 A class of AMCMC algorithms 

We describe the general framework of AMCMC considered in this paper. 
Let X denote a general state space. An AMCMC algorithm is a stochastic 
process {(X n , Y n )} n >o in X x X, designed such that the main chain X n 
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converges to the target distribution 1 in a certain sense to be described 
precisely later. Let V denotes the set of all probability measures on X, and 
{Kg, 6 G V} a set of transition kernels on X. Let P be a transition kernel on 
X with invariant probability measure ir. Starting from P and {Kg, 6 G V}, 
we consider the family of transition kernel Pg given by 

Pg(x, •) = (1 - e)P(x, •) + eK e (x, •), 6 G V, x G X. 

The dynamics of the AMCMC algorithms considered in this paper can 
be unified as follows: given T n = u(Xq, . . . , X n , Yq, . . . , Y n ), X n+ \ and Y n+ \ 
are conditionally independent, and for all bounded measurable functions 
h:X^R, 

E x (h{X n+ i) | T n , {yfe}fe>„+i) = Eje (h(X n+ i) | JF n ) = P^Ji(X n ), almost surely 

where 6 n {-) = f n" 1 Y^j=i ^ (')> denotes the empirical measure associated to 
the auxiliary chain {yn} n >o- Each algorithm is determined by the choice of 
the kernels Kg. For the IRMCMC, 

} x w(z)dz 

where w(z) = dir / dny (z) (see Section [1.21 for our convention on ir and d7r), 
while for the EE, the following choice is made 

K e (x,A) = l A (x)+ [ (lA ^lM ) (l A (z)-l A (x))9(dz). 
Jx V 7r{x)ir Y {z)J 

In both cases, ny is an auxiliary distribution chosen such that it is relatively 
close to 7r, and easy to sample from (at least easier than tt). We assume 
that the evolution of the auxiliary train is independent of the main chain in 
the sense that for bounded measurable function h : X — >■ R, 

E(h(Y n+1 ) | F n ) = E(h(Y n+1 ) | Y ,...,Y n ), almost surely 

The description of the dynamics of the joint process {(X n ,Y n )} n >o is com- 
pleted by specifying the dynamics of Y n+ \ \ T n , which is either a Markov 
chain with target distribution iry, or the main chain of another adaptive 
MCMC with target distribution 7ry, not necessarily Markov. 

The rationale of these algorithms can be viewed as follows. For 9 = 
6* = 7ry , the Markov chain with transition kernel Pg t have nice mixing 
properties, due to the choice of ny ■ Unfortunately, however, it is not possible 
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to implement directly the kernel Pg t . The idea here therefore is (i) to run an 
auxiliary chain {Yn} n >o with limiting distribution 7Ty, so that the empirical 
measure 9 n converges to 9*, and (ii) to sample X n+ \ from 



which approximates PQ^(X n , •) as n — > oo. 
1.2 Notation 

We assume that the state space X is a Polish space equipped with a metric 
d, and B is the associated Borel <r-algebra. In addition, (X,B) is a measure 
space with a reference cr-finite measure, which we denote for short by dx. 
Let 7r and 7ry be probability measures on {X ,B). We assume that ir and 7ry 
are both absolutely continuous with respect to dx and with a little abuse of 
notation, we also use ir and iry to denote the density respectively. That is, 
we write Tr(dx) = ir(x)dx and similarly for 7ry. For a transition kernel Q, 

a measure v and a function h, we shall write i>Q(-) d = J v(dz)Q{z, •), and 

Qh(-) = fQ(-,dz)h(z). We denote 7TY, n the empirical measure associated 
to the auxiliary chain {Y n } nG ^ defined by 



At times, we also use the notation 6 n (-) to denote 7fy n (-)- For functions 
/ : X — > R, we write 



We let C denote general constants that do not depend on n, but may change 
from line to line. 

2 Importance Resampling MCMC 

We consider the importance-resampling Markov Chain Monte Carlo method 
described in Atchade [Hj. 

Algorithm 1 (IRMCMC). Fix e <E (0,1). Pick arbitrary X = x and 
Yq = Vo- Let P be an arbitrary Markov kernel with stationary distribution 



Pg (X n , •) = (!- e)P(X n , •) + eK~ e (X n , •) 




i=i 



7Ty,n(/) = 7fy,n(/) ~ 7Ty(/). 
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7r. At each round n, X n and Y n are conditionally independent given J- n -i, 
and 

P(X n _i,-) w.p. 1 - e, 
n -i(O w.p. e, 



X n I .Fn-1 ~ 



where n is the (randomly) weighted empirical distribution defined by 

an V s * m /ggg^jdg) 

^ Ej=i ^(^) J* w(2)^y,n(dz) 

with w(y) oc 7r(y) /vry(y) =: and #o = We assume |io|oo = f 

SUp^gA' \w(x)\ < OO. 

Remark 1. The assumption on the boundedness of w is not too restrict. 
Indeed, very often in practice, we have tt, the un-normalized density function 
of 7r as a bounded function, and set the auxiliary chain with stationary 
distribution 7ry oc 7ry obtained by 7?y = tt t with T £ (0, 1). In this case, 
w = tt/tty is bounded and thus so is w. 

Equivalently, for any bounded function / : X — >■ R, 

E(/(X n+ i) | T n ) = PgJ{X n ) almost surely, 

where for all probability measures 9 on X, Pg(x,-) is defined by 

P 9 ( x ,.) = (l-e)P(x,-) + e9(-). (2) 

For the time being, we make no particular assumption on the dynamics of 
the auxiliary chain {Y n } n >Q. 

2.1 Convergence rate of IRMCMC 

The following equivalent representation of Algorithm Q] is useful. Let the 
auxiliary chain be as above. Furthermore, let {Z n } n >o be a sequence of 
independent and identically distributed random variables with ¥(Z\ = 1) = 
1 — P(Zi = 0) = e. Furthermore, we assume that {Z n } n >o and {Y n } n >o are 
independent and for each n > 1, Z n and T n -\ are independent. Then, at 
round n, we can introduce Z n , and write the conditional distribution of X n 
given Z n ,F n -i as 



X n | Tn-l, Z n ~ 



P(x n _i,-) if z n = o 
e n _i(0 \iz n = \. 
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Define 

ro = 0, Tj+i = min{A; > n : Zk = 1} and n* = max{/c : < n} . (3) 

Observe that at each time > 0, conditioning on Yo,Y\, . . . , Y Tfe _i, X Tk is 
sampled from Tk -li independent of Xq, . . . , X Tk _\. Furthermore, Yq, . . . ,Y n 
are independent from n, . . . , r n * . Therefore, we first focus on 

7? n d = F(X n+1 G • | Z n+1 = 1) = E0 n (-) , n € N . (4) 

We first obtain a bound on the total variation distance \\r] n — vr|| TV . Recall 
that, given two probability distributions /i and u, the total variation distance 
|| ju — z^Hxv ^ s defined by: 

IIm- "Htv = o su p M/) - K/)! • ( 5 ) 
2 l/l<i 

For convenience, write 

B n = f \w\oo sup E7fy |n (/) + 2|t«|^ sup E (n Y) n(7)) 2 , n € N. (6) 
l/l<i l/l<i 

Recall that throughout we assume |w|oo < oo. 

Lemma 1. For all n € N, 

||??n — ^IItV 

< B n . (7) 

Remark 2. A special case of ||% — vr|| TV , when w = 1 (equal weights), is 
the so-called cesaro mixing time of {Y n } n >Q. See for example Levin et al. 



1& Chapter 6.6]. 



The proof of Lemma [T] is postponed to next subsection. We have no 
explicit requirement on 7ry, except that w = ixjixy is bounded. However, 
the convergence of Y\ , Y2 , . . . to ixy is implicitly ensured when we require 
further sup^^ E(7ry, n (/) - 7iy(/)) + sup^^ E(7ry; n (/) - 7ry(/)) 2 ->■ as 
n — > 00. Indeed, these rates yield an upper bound on the convergence rate 
of Cx n => 7r, as shown in the following theorem. We set Bq = B-\ = 1. 

Theorem 1. Consider {X n } n£ ^ generated from Algorithm^ Then, 

n 

||/:x„-vr|| TV <^(l-6r-^_ 1 . (8) 
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Furthermore, for any bounded measurable function f , 



E 



1 

— ^(/(X,) - *(/)) 

i=l 



2 




The proof of Theorem [T] is postponed to next subsection. 

Remark 3. The control of the total variation distance depends on the 
initial position of the auxiliary chain, as in general, B n depends on the 
initial position Yq = uq. We omit this dependence throughout this paper. 
At the same time, it is obvious that the initial position Xq = xq is irrelevant. 

Remark 4. In Theorem [IJ we do not assume any ergodicity assumption on 
the kernel P. In the case P is say, geometrically ergodic, one can improve 
(|8) quantitatively by bounding the term \\r]kP n ~ k — tt j | TV more effectively. 
For example, if P is uniformly ergodic with rate p, then (|5|) would become 

n 

\\£>X n -7t|ItV < Yl ^ ~ ^F^Bt-l- 

A similar improvement can be formulated for 0. However, these improve- 
ments do not change the rate but only the constant in the corollary be- 
low. Beside, such improvements will not be easily available if P is sub- 
geometrically ergodic. 

Now, as a corollary we obtain an upper bound on the convergence rate 
of IRMCMC algorithm, under the following assumption. 

Al There exist a finite constant C such that for all measurable function 
/: X^R, with |/U < 1, 

E£y, n (/)<- and E (5fy,„(/)) 2 < -. (10) 
n n 

Remark 5. For example, if {l^} n eN is a Markov chain with transition 
kernel Py and stationary distribution 7ry, then the first part of (jlOp holds if 
P is geometrically ergodic J22}]. The second part of (jlOp essentially assumes 
that the sample variances of {f(Y n )} n( zfq is bounded, which also holds for 
geometrically ergodic Markov chains under appropriate moment condition 
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on / (e.g. tt(|/| 2+<: ) < oo with e > 0). Occasionally this condition can 
fail, as the sample variance might become infinity in the limit. See for 
example Haggstrom [l4| and Haggstrom and Rosenthal 15]. 

Corollary 1. Consider the importance resampling MCMC (Algorithm^). 
If Assumption \(A 1)\ holds, then there exists a finite constant C such that 

C 



\£>X n -7T| 



TV 



< — 

n 



Furthermore for any bounded measurable function f, 

2 



E 



1 



(/)) 



<C\f\l,nen. 



2.2 Proofs of Lemma Q] and Theorem [T] 

Proof of LemmaUl Fix n > 1 and write ny = T^Yn- Rewrite rj n (f) as, 



Vn(f) 



E 



E 7ry,n(W) + (7Ty(w) - Tf Y , n (w))6 n (f) 



where in the second term above we used the fact that ity(w) = 1. Since 
tt(/) = n Y (wf), 

IK - 7T|| TV = SUp - (r/ n (/) - 7r(/)) 

I/I<1 2 

< - sup E7ry n (u;/) + - sup E (xY,n(w)9 n (f) 



I/I<1 



I/I<1 



< \w\oo sup E7rV n (/) + ~ sup E 7ry n (u;) ( n (/) - ir Y (wf) 



l/l<i 



l/l<i 



(11) 



By Cauchy-Schwarz inequality, 

7Tyn(w) - 7Ty(w/) 

1/2 



sup E 

l/l<i 



< 



E (7ry >n («;))' 



x sup 

I/I<1 



E [On(f)-Mwf) 



.1/2 



(12) 
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The first term is bounded by ju^oo sup|j|<; 1 [IE(7ry jn 

(Z)) 2 ]V 2 - For t j ie seconc i 

term, observe that 



E (e n (f)-iry(wf) 



<2E fl n (/)-VW) +2K(n Y)n (wf)-7r Y (wf))\ (13) 



and 



(1 - Ty, n H) 2 ^(/) I < E (Try (id) - 9 Y , n (w)f 



< \w\^ SU P E(7?y, n (5)) 2 , (14) 

lsl<l 



E 



and the above calculation holds for all / : |/| < 1. Combining (fTT|) . (fl~2j) . (fl~3j) 
and (]14p yields the desired result. □ 



Proof of Theorem [/J We recall that r n * is the last time k before n that the 
main chain is sampled from 6k-i- Now, we can write 

||£x n -7r|| TV = supi|E/(X„)-7r(/)| 



l/l<i 



sup - 

I/I<1 2 
1 

SU P o 

I/I<1 2 



J2Hf(Xn),T n *=k)-n(f) 

k=0 
n 

J>(r n . = k)\E(f(X n ) | r n * = k) - vr(/)] 



k=0 



n 

< ^P(r n « = fc) sup -|E(/(X n ) | r„, = k) - vr(/)|.(15) 



fc=0 



l/l<i 



Observe that the conditional distribution of X n given that r n * = k > 1, is 
7] k ^iP n ~ k (set r? = £y ). Then, 

sup l\E(f(X n ) | r n * = A:) - tt(/)| = sup i^P-^/) - vr(/)| 



l/l<i 



l/l<i 



TV 
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By the fact that irP = it, we have ||r/fc_iP n k — vr|| TV < ||%_i — tt|| tv < 
B k _i, by Lemma [TJ Also P(r n . = k) = e(l — e) n ~ k for k = 1, . . . , n and 
P(r n » = 0) = (1 - e) n . Thus, ([T5j) becomes ©. 

To establish ([9]), we show that the partial sum Y^k=i (fi-^-k) ~ ^(f)) 
admits a well behaved martingale approximation. For a probability measure 
9 on X, define 

oo 

7 r e (A) = eY,(l-t) j (0P i )(A), A e B. 

j=o 

Clearly, ng is a probability measure on (X,B), and in fact we have for all 
A e B, 

oo „ 

ir e P e (A) = e^(l - ey I (9Pi)(dz) ((1 - e)P(z, A) + e9(A)) 

3=0 

oo 

= e^(l - ey +1 (9Pi +1 )(A) + eO(A) = ir (A). 

3=0 

This means that the kernel Pg is invariant by irg. It is also easy to check by 
induction that for any bounded measurable function /, and n > 1, 

oo 

- Mf) = (1 " e) n P n f(x) - ej> - (16) 
It then follows that 

\\P£(x,-)-ir e \\ Ty <2(l-er. (17) 
As a result of (fT7|) . the function 

oo 
3=0 

is well-defined with l^loo < 2e~ 1 |/| 00 , and satisfies Poisson's equation 

ge(x)-P e ge(x) = f(x)-TTe{f), xgX. (18) 
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In particular, we have f(X k ) - ^Q kl (f) = 9g kl {X k ) - P^ kl 9Q kl {X k ), al- 
most surely. Using this, we write: 

n n n 

E (/(**) - -(/)) = E (n k Jf) - <fj) + E (/(**) - *o k Jf) 

k=l k=l k=l 

n n 

= E fajf) - -(/)) + E - 
^=1 fc=i 

n n 

+E K-Av^-i) - W x *))+E (w**) - V^tf 



^=1 



/c=i 



Using (fTHj) . the definition of and (fT6|) . it is easy to check that for any 
probability measures 9,9' and x £ X, 



x) - Pe>9e> 0) 



J(9'-9)(dz) Lpj(l-eypif(z)j 



this implies that the term YHt=i i^e^eS^^ ~ \®0k S^ k ^) m a 
scoping sum and we have 



(0»-0o) U^j(i-W 



3=0 



The term X^fc=i ( ~~ ^6^9^^^) lS a ^ SO a telescoping sum 
and we have 



fc=i 

From the definition of 7T0, notice that we can write 

n n 



fc=i 
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where f e {x) = £^2JLo(^ ~ e) 3 P J f(x). Thus 

< fElE 1/2 ^i(/e-vr(/ e ))) <|/| 2 oo(EV^ 



E 



.k=i 



n-1 



\fc=l / \fc=0 / 

where in the last equality, we use the fact that sup|j| oo<1 E#|(/ — vr(/)) < 
which is (|13|) and is proved as part of Lemma [TJ 

Finally we also notice that Y2=i ( x k) ~ P e fc _ 1 %_ 1 (^fe-i) 

X]fc=i ^fe is a martingale with respect to {T n }, whence 



2 n 



\k = l / fc = l 

Using all the above, we obtain ([9]). □ 
2.3 An example on the lower bound 

We provide an example where 0(n~ x ) is also the lower bound of the rate for 
both \\r] n — 7r|| TV and \\£x n ~ ^IItv This shows that the rate in our upper 
bound in Corollary Q] is optimal. 

Example 1. Consider the simple case when X = {±1}, and ir = ixy . In 
this case, the weight function is uniform (w = 1). Suppose the auxiliary 
chain {l^,} n >o has transition matrix 

Py=( l ~ a x a _ h \ with a, b € (0,1). 

The corresponding Markov chain has stationary distribution iry = (a + 
b)~ l (b,a) and eigenvalues Ai = 1, A2 = 1 — a — b. Suppose a + b ^ 1 and 
the chain starts at Yq = — 1. By straight-forward calculation, W(Y n = — 1) = 
a/ (a + b) + b/(a + b)\%. Then, 



E7?y in ({-1}) - 7Tr({-l}) 



1 n 



a 1 A2 — A 2 



n+1 



n *r~i a + bn 1 — A2 
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It then follows from the definition that Hr/^ — ^Htv 

> C/n. 

Furthermore, in ([2]) set P(x,-) = vr(-). That is, P is the best kernel we 
can put into the algorithm, in the sense that it takes one step to arrive at 
the stationary distribution (although this is too ideal to be practical). Now, 

F(X n = -1) - tt({-1}) = (l-e)7r({-l})+eE^y in ({-l})-7r({-l}) 

= e (E7ry jn ({-l})-7Ty({-l})) • 

It then follows that \\£x n — vr|| TV > C/n. 
2.4 Multiple IRMCMC 

We discuss importance-resampling MCMC algorithm in forms of multiple 
chains and establish a similar convergence rate as in Section 12.11 



Algorithm 2 (Multiple IRMCMC). We construct iteratively m discrete- 
time stochastic processes 

n>oA = 0, ...,m as follows. Fix 
e £ (0, 1). Let X(°> be a Markov chain with target distribution ttq starting 
at xq. Then iteratively, for each I = l,...,m with X^~^ constructed, 
design X^> starting from xg, so that X® an d X^ interact as the main 
chain and the auxiliary chain respectively in Algorithm [TJ Namely, let Pi 

(£) 

be a Markov kernel with stationary distribution 7T£, and sample X^ {-^ from 
P t ^-v>{X^,-) with 

PifiM = (l-e)Pt(x r ) + e9(-) 

and 

with Wi(x) = ir£(x)/ir£-i(x),x G X . Note that the £-th chain X^ at time n, 

depends on {A^ ) } fc=0i ..., n -i^=i,...m-i- We assume that max^i,...^ \we\oo < 
oo. 

In view of Theorem [H it suffices to control 

BCO d = sup m xW n (/) + sup E (n xW n (f)) 2 , n G N, (19) 
l/l<i l/l<i v ' 

where this time n X (e) n (f) d = t^xW n(f) ~ ^e-iif)- I n fact, it suffices to 
control Bn^ , which is the purpose of the following assumption. 
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A2 As n — > oo, the initial Markov chain {Xn^} n >o satisfies Bn^ < C/n. 



Theorem 2. Consider the multiple IRMCMC (Algorithm^) for which As- 
sumption \(A2)\ holds and max^ = i v .. im \we\oc < oo. Then for I = 1, . . . , m, 
there exists a finite constant C such that 



TV n 



and for any bounded measurable function f , 



E 



V 1=1 



< c. 



(20) 



(21) 



Proof. Simply observe that (I20p and (121j) imply that B$ < C/n, asn — > oo. 
By Theorem [H this implies in turn that (|20[) and (|2ip hold with i replaced 
by I + 1. Given [(A2)| the result follows by induction. □ 



3 Equi-Energy Sampler 

In this section, we consider the simplified EE sampler as follows. Recall 
that the auxiliary chain {Y n } n >Q evolves independently from the main chain 

{Xn} n >Q. 

Algorithm 3 (Equi-Energy sampler). Fix e £ (0,1). Start Xq = xq and 
^0 = 2/0- At each round n, generate 

J P(X n _!,-) w.p. 1-e 

\ K e n J X n-i,') w.p.e 

where # n = 7Ty jn is the empirical measure associated to {Y n } n >o and Kg is 
defined by 

K e (x, A) = l A (x) + [ ( 1 A _ i^)) 9 (dz). 

In other words, for all non-negative functions h : X — > R and n E N, 

E x (/i(X n+i ) I F n ) = P s h(X n ) almost surely, (22) 

where for any probability measure on X, Pg is defined as 

Pg(x, A) = (1 - e)P(x, A) + ei^(x, A), (23) 

Recall that we write ir(dx) = 7r(x)dx and similarly for Tiy with a little abuse 
of language, and w{x) = 7r(x)/7Ty(x). We assume \w\oo < oo. 
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The kernel Kg t is the Independent Metropolis kernel with target ir and 
proposal 9* = ny- It is well known that under the assumption \w\oo < oo 
(recall Remark[T]), the kernel Kg t is uniformly ergodic and this property 
is inherited by Pg^. That is, there exists p £ (0, 1), such that 

\\Pl(x,-)-Tr(-)\\ Ty < Cp n , n>0. (24) 

3.1 Convergence rate of EE sampler 

We make the following assumptions. 

A3 There exist a finite universal constant C such that for any measurable 
function / : X — > R, with |/|oo < lj 



supP 



1 n 



n 

3=1 



> x < C exp 



x 2 



Ca 2 (fj) ' 



where cr 2 (/) = f f x f 2 (x)-KY (dx) 



A4 The function w : X — > R is continuous (with respect to the metric on 
X), and 

sup 4Sr < °°> ( 25 ) 

where 0(x) = f tty {{z : w(z) < w(x)}). 

A5 The kernel P is such that if / : X — > R is continuous, then Pf is also 
continuous. 

Remark 6. Deviation bounds as in | (A3) | are available for various conditions 
on transition kernels. See for example Cattiaux and Guillin 0, Proposition 
1.2]. 

Remark 7. Assumption [(A4) is not restrictive. For example, consider X 



and Try = 7r with some T £ (0, 1). For the sake of simplicity, we focus 

on x € R+ and define 4>+(x) = f tty({z > : w(z) < w(x)}). Suppose the 
density -ir(x) decays asymptotically as x~ a for a > 1 as x — > oo. Then, 
tty(x) ~ x~ Ta and w(x) ~ x {T-i)a _ jj ere anc j k e i OW) we wr ite a(x) ~ &(x) 
if lim^-j.oo a(x)/6(x) = 1. Assume further that Ta > 1. Then, (f>+(x) ~ 
(Ta - l)-V" Ta and 



1 



w 2 {x) Ta — 1 
Therefore, ([25]) holds, if T > (1 + 2a)/(3a) 



3 ,l+2a-3Ta 
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Theorem 3. Consider the Equi-Energy sampler described as above and sup- 
pose that Assumptions (A3)- (A5) hold. Then, there exists a constant C, 
such that for all continuous functions f : X — )■ R and n6tf, 



|E(/(X n )-7T(/))| < 



C\f\c 



11 



(26) 



Proof. Fix n > 2 and 1 < q < n. Fix / : X — > K with |/|oo = 1- Then write 



E x f(X n )-Plf(x) = E x (Pr q f(X q ) - P?J(x))-E x [Pr q f(X q ) - f(X n ) ■ 



< C P n ~ 



For the first term we can use (|24p to get: 

E x (p^f(X q )-Plf(x) 



for some finite constant C that does not depend on /. For the second term, 
we write: 



E x [p r e l -y{x q )-f{x n 



E, 



n-l 



n-l 



i+i, 



3=1 



5> [pz j f(xj 



E x R 



x \ 9+ 



jn-j-1 



f(x : 



ft 



3=1 
n-l 



3=1 
n-l 



3=1 

where in the last line we write 

7 n-j-l 



(27) 



_ Pe~ (/») " #*(/)) 



with Co and p chosen as in (|24p. As a consequence of (|24p . |Cnj|oo < 1- It 
is also continuous by the continuity of / and Assumption |(A5) 
To simplify the notation, for any function g : X — > K, define 



H g (x, z) = f a(x, z) (g{z) - g(x)) , x, z e X, 



(28) 
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where 



Thus, we can write 



, \ dcf W(z) 

a(x, z) = 1 A . . 

w(x) 



(29) 



P e g(x) - P e ^g{x) = e J H g (x, z)(6(dz) - 0*(ds)). 

Always based on g : X — > R, we introduce the class of functions 

•^5 = f {z !->■ i?g(a;, z) : x G Af} , 
and the empirical process 

Therefore, the expectation term in (|27p becomes 
E, [(p 0i - Pg-.) ^i(^)] = eE x J H^iXj^Xe^dz) - Ojidz)) 



whence 



n ^ ^ n— j — 1 



We prove in Lemma [2] below that for any continuous function g : X — > 
such that Igloo < 1, 

E x f sup \G n (h)\) < C, 
\heT g J 

for some constant C that does not depend on n nor g. We conclude that 



E, 



n—l 1 

(p^/(I,)-/(I„))|<^i 



n-j-l 



3=1 
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Thus for any 1 < q < n, 

\E x (f(X n ))-9*(f)\<C{ 



by choosing q = n- [ ■ 



n—l n _j_i 

p n + p n- g + e J- P —— \ < Cn-V\ 



21ogp. 

□ 



We rely on the following technical result on the auxiliary chain {Y n } n >Q. 

Lemma 2. Suppose that Assumptions \(A3)\ and \(A4~j\ hold, and let g : X — ?■ 

K be continuous such that \g\oo < 1- Then 



E x sup \G n (h)\ < C, 



for a constant C that does not depend on n. 

Proof. Throughout the proof n > 1 is fixed. Assumption |(A3)| suggests the 
following metric on T g : 

d(h 1 ,h 2 ) = a(h 1 -h 2 )= [ I (hi(x) - h 2 {x)f vry(dx) 



which has the following properties. For xi,x 2 G X, it is easy to check that 
\H g (xi,z) - H g (x 2 ,z)\ < 2 \a(xi,z) - a(x 2 ,z)\ + \g(xi) - g(x 2 )\ . (30) 
It follows that 

d(H g (x lr ),H g (x 2 ,-)) 



<V2\g(x 1 )-g(x 2 )\ + 2V2J / \a(x u z) - a(x 2 , zf ir Y (dz). (31) 



This implies that the diameter of JF g is bounded by 6(J- g ) = 4y/2. It also 
implies that with respect to d, the empirical process {G n (h), h £ J- g } 
is separable. Indeed, for x G X arbitrary and h = H g (x,-), using 
the Polish assumption, we can find a sequence x m G X (x m belongs 
to a countable subset of X) such that x m — > x, as m — > oo. Set- 
ting h m = H g (x m ,-), it follows from (|3T|) and the continuity of u and 
E that h m — > h in (J-, d), and (|30p easily show that G n (h m ) — G n (h) = 
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^ 1/2 E"=l (H g (x, Y t ) - H g (x m ,Y e )) + ^Ett y (H 9 (x, •) - H g (x m , •)) ->• as 
m — > oo for all realizations of {Yi, . . . , Y n }. 

For any hi,fi2 G Assumption |(A3)| implies that for any x > 



(|G n (/ii 



^n(^2)| > a?) < C*exp 



cd 2 (h 1 ,h 2 ) 



Then we apply van der Vaart and Wellner [23l . Corollary 2.2.8] to conclude 
that for ho £ JF g , there exists a constant C independent of g, such that 



E x sup \G n (h)\ <E x \G n (ho)\ +C 



1 + logD(e, Jg,d) de < oo, 



where D(e, J- g , d) is the packing number of T g with respect to d. Assumption 
|(A3)| shows that E x |G n (/io)| < oo. To control the entropy number, we further 
bound the right hand of (j31|) . 

Without loss of generality, assume x\,x 2 G X and w(x\) < w{x2). If 
w(xi) V w(x 2 ) < w(z), then a(xi, z) — a(x2, z) = 0. If w(z) < w(xi), then 



\a(xi,z) - a(x 2 , z)[ 



w(z) w(z) 



w(xi) w(x 2 ) 
If w(xi) < w(z) < w(x2), then 



< 



(w(x 2 ) - w(xi)) 2 . 



\a(xi, z) - a(x 2 ,z)\ = 
Thus 

|a(xi, z) — a(x2, z)\ 2 ny (dz) 
<j)(xi) 4>(x 2 ) 



w(z) 



w{x 2 



< 



W{X 2 



(w(x 2 ) - w(xi)Y 



< 



+ 



w{x 2 ) 



(w(x 2 ) - w(xi)) 2 < C {w(x 2 ) - w(xi)) 2 , 



clef 



where 4>(x) = Try ({z : w(z) < w(x)}), and the last inequality follows from 
pl)| Together with we conclude from this bound that 



d (H g ( Xl , -),H g (x2, •)) < C(\g( Xl ) - g(x 2 )\ + \w(x 2 ) - w( Xl )\). (32) 

Since \g\oo < 1 an d w{x) G [0, \w \ QO \, this implies that the e-packing number 
of (F g ,6) is at most of order e~ 2 , so that f 3 ^ \/l + log D(e, F g ,6) de < 
C Jq^'^ \/l + log(l/e)de < oo. This proves the lemma. □ 
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3.2 Connection with Parallel Tempering 

Our results suggest that the EE sampler mixes relatively slowly. A plausible 
reason for this slow mixing is the dependence on the entire sample path 
{Yfc}o<fc< n - The EE sampler is closely related to the Parallel Tempering 



(PT) algorithm of [121 ]. which suggests that it might be possible to exploit 
this connection by deriving versions of the EE sampler with better mixing 
properties. Like the EE sampler, a 2-chain PT generates a stochastic process 
{(X n , y n )} n >o where with probability 1 — e, X n is generated from P(X n _i, ■) 
and with probability e, one proposes to swap the two chains. Thus PT is 
closely related to a EE-type algorithm where the empirical measure TXy.n 
would be replaced by a Dirac measure 5y n ■ However, we show that in general, 
this new algorithm does not maintain the correct stationary distribution. We 
hope that the discussion in this section would be helpful in conceptualizing 
new adaptive algorithms in the future. 

The modified EE sampler is as follows. Let {Y n } n >Q be the auxil- 
iary chain with transition kernel Py and stationary distribution ny. Let 
{X n } n >o be a chain satisfying the following assumption: for all continuous 
and bounded function /, 

E[f(X n+1 ) | X , . . . , X n , Y , . . . , Y n ] = P SY J(X n ),n G N, 

where P$ is as in (|23h . and denote the stationary distribution of P by ttx- 



Remark 8. The difference from the EE sampler is that we replace 7?y in by 
5y n . If, when X n+ \ is moving to Y n , we also make Y n+ i move to X n , then 
we are allowing swaps between the two chains. Such swaps are in the spirit 
of parallel tempering algorithms (see e.g. Geyer 

A nice property of this algorithm is that {(X n , l^)} n >o is a Markov 
chain. Indeed, it has transition kernel 

Px,y{x,y,dz,dw) = P Sy (x,dz)P Y (y,dw) . (33) 

This Markov chain may not have the desired stationary distribution. Let 
ttx,y denote the stationary distribution and let tt x y ,i = 1,2 denote its two 

marginal distributions. Naturally, we wish = an d = 7TY ' ^ 
construction, the latter identity is always true. However, the former does 
not always hold. Since Pg = (1 — e)P + eKg and P(x, dz)Py (y, dw) has 
stationary distribution Trx(dz)iry (dw), instead of (j33|) it suffices to focus on 
the transition kernel 

Px,Y (x,y,dz, dw) = K Sy (x, dz)Py (y, dw) . (34) 



21 



Consider the simple case when both chains take values from {±1}. Let 
the auxiliary chain has the following transition matrix and stationary dis- 
tribution: 

PY= { 1 ~b a 1-6 ) and ^ = (^'^T6 
Recall that in this case, 

K Sy {x, z) = a(x, y)l{ z=y } + (1 - a(x, y))l{ z=x } , 

with 



a(x, y) = 1 A 



7rx(y) Try (a;) 



TT X (x) 7Ty(y) ' 

Write c = a(l, —1) and d = a{— 1, 1). Then, one can write the transition 
matrix of Px y as follows: 





(1,1) 


(1,-1) 


("1 


1) 


(-1,-1) 


(1,1) 


1 - a 


a 










(1,-1) 


(1 " c)b 


(l-c)(l-6) 


do 




c(l - 6) 


(-1,1) 


d(l - a) 


da 


(1- 


d)(l-a) 


(1 - d)a 


(-1,-1) 








b 




1-6 



For example, the table reads as 

P(l, -1,1,-1) =Pi_ 1 (l,l)iV(-l,-l) = (l-c)(l-6) 
We solve irx yPx Y = Y and obtain 



1 



TTXY 



(a + &)((!- a - 6)cd + ac + 6d) 



/ 6(2(6+ (1- a- 6)c) \ 
abd 
abc 

\ ac(a + (1 - a - 6)d) / 



To see that vr^y does not always equal irx, take for example a = b = 1/3 
and 7Tx = (2/3,1/3). In this case, c = 1/2, d = 1, and 7rx,y = 
(3/8, 1/4, 1/8, 1/4), whence 7rJ y = (5/8, 3/8) ^ tt X - 
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